Improved LZP

paradox

Introduction

Back a good few Hugi issues, there were some articles on improving LZP... So, I thought I'd share a few of mine. As well as in Hugi, Dario Phong has some neat ideas for LZP (multiple contexts, continuous updation), see his homepage for details.

My ideas are entirely untested, so they might be a load of crap. Also, I'd say the resultant compressor would be significantly slower than original LZP. However, it should still be acceptably fast and would use much less memory than typical PPM or CTW implementations.

LZP ideas

Ok, I wouldn't use the fast old way LZP uses for finding occurences of a particular context in the stream, since you get hash collisions. I'd either use a balanced binary search tree (red black tree probably) or a linked hash table. That way we can have order-5 LZP if we want, order-5 is probably the highest you'd want.

Anyway, ideas for making a higher performance implementation (with not too much loss of speed):

(i) Lots of contexts.
(ii) Better match encoding.
(iii) Exclusion for (i)

And some other things which will become apparent later.

(i)

If we use say 8 contexts for finding a match, then the most natural thing is to use a flat 3 bit code for each one - but this goes against the whole idea of LZP, the most recent one will be used much more often. So, we can use some kind of adaptive probability estimation technique, and then use arithmetic coding to say which one we have used. I bet up to about 50 contexts would be good, although it'd use more memory. Now we could tell if the most recent context had been used probably in around a quarter of a bit.

(ii)

Now, this method is again undeveloped, there are heaps of different ways to write numbers out, but I think an adaptive scheme would work nicely. Something like this,

if(matchLength > 5) arithCode(aboveFiveProbability);
if(matchLength > 10) arithCode(aboveTenProbability);
...

Now, there are some points to notice, matchLength refers to the final match length, after parsing techniques have been completed (I'd try a good optimal parsing heuristic) Second, the numbers '5' and '10' for match length tests are rather arbitrary, again, we should probably adaptively model them.

(iii)

If we use say 20 contexts to find matches with, then we may experience correlations between some subset of the contexts themselves, in fact, it is very probable, assuming LZP is a sensible technique. So we create a list of known unacceptable symbols, e.g. if we have "ABCD" as a context, and the symbols following it are "E...", and we have 4 other occurences of "ABCD" 3 of them may be followed by "F...", and say, the least recent, by "E...", so we exclude the others from probability estimation so that we get a shorter arithmetic code.

Conclusion

I think the above would work quite well, the probability estimation techniques are completely undeveloped, but they shouldn't be too much work. I'm not sure of the speed, but it should be pretty respectable, and the compression around the low 2 bpbs hopefully. Anyway, tell me if you get anywhere with this, or you have any questions/comments etc,

-- paradox